Journal of Systems Engineering and Electronics ›› 2011, Vol. 33 ›› Issue (1): 196-0201.doi: 10.3969/j.issn.1001 506X.2011.01.40

• 软件、算法与仿真 • 上一篇    下一篇

基于新型集成分类器的非平衡数据分类关键问题研究

翟云1,2,杨炳儒1,曲武1,隋海峰1   

  1. 1. 北京科技大学信息工程学院, 北京 100083;
    2. 聊城大学计算机学院, 山东 聊城 252059
  • 出版日期:2011-01-20 发布日期:2010-01-03

Study on source of classification in imbalanced datasets based on new ensemble classifier

ZHAI Yun1,2,YANG Bing-ru1,QU Wu1,SUI Hai-feng1   

  1. 1. School of Information Engineering, University of Science and Technology Beijing, Beijing 100083, China; 
    2. College of Computer Science, Liaocheng University, Liaocheng 252059, China
  • Online:2011-01-20 Published:2010-01-03

摘要:

针对非平衡数据分类问题,提出了一种基于差异采样率的重采样算法(differentiated sampling rate algorithm, DSRA),基于DSRA设计了一种新的集成分类器(SVM-Ripper ensemble classifier, SREC)。SREC采用独特的分类器选择策略、分类器集成策略、分类决策方案,可获得较高的分类精度。同时,利用SREC对影响非平衡数据分类的关键问题进行了研究。结果表明,非平衡数据分类问题本质上是由正负样本类间非平衡、类内非平衡、样本规模以及样本非平衡度等诸多因素引起的,只有综合考虑这些因素才能更好地解决非平衡数据分类问题。

Abstract:

For the issue of classification in imbalanced datasets, this paper presents a new differentiated sampling rate algorithm (DSRA), on this basis, a SVM-Ripper ensemble classifier (SREC) is proposed. SREC employs an unique classifier selection strategy, a novel classifier integration approach and an original classification decision-making method, so that it receives a higher classification accuracy. At the same time, the source of classification in an imbalanced dataset is studied by use of SREC. The simulation results prove that the source of classification in an imbalanced dataset is the aggregation of imbalance between classes, imbalance within a class, sample size as well as the imbalance degree, and only a comprehensive consideration of these factors can better address the issue of classification in imbalanced dataset.